[DWARF] Speedup .gdb_index dumping #151806

itrofimow · 2025-08-02T08:26:22Z

This patch drastically speed ups dumping .gdb_index for large indexes

llvmbot · 2025-08-02T08:26:54Z

@llvm/pr-subscribers-debuginfo

Author: None (itrofimow)

Changes

This patch drastically speed ups dumping .gdb_index for large indexes

Full diff: https://github.com/llvm/llvm-project/pull/151806.diff

1 Files Affected:

(modified) llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp (+20-5)

diff --git a/llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp b/llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp
index 987e63963a068..c0ad2a38df373 100644
--- a/llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp
+++ b/llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp
@@ -17,6 +17,7 @@
 #include <cinttypes>
 #include <cstdint>
 #include <set>
+#include <unordered_map>
 #include <utility>
 
 using namespace llvm;
@@ -60,6 +61,24 @@ void DWARFGdbIndex::dumpSymbolTable(raw_ostream &OS) const {
                ", filled slots:",
                SymbolTableOffset, (uint64_t)SymbolTable.size())
      << '\n';
+
+  std::unordered_map<uint32_t, decltype(ConstantPoolVectors)::const_iterator>
+      CuVectorMap{};
+  CuVectorMap.reserve(ConstantPoolVectors.size());
+  const auto FindCuVector =
+      [&CuVectorMap, notFound = ConstantPoolVectors.end()](uint32_t vecOffset) {
+        const auto it = CuVectorMap.find(vecOffset);
+        if (it != CuVectorMap.end()) {
+          return it->second;
+        }
+
+        return notFound;
+      };
+  for (auto it = ConstantPoolVectors.begin(); it != ConstantPoolVectors.end();
+       ++it) {
+    CuVectorMap.emplace(it->first, it);
+  }
+
   uint32_t I = -1;
   for (const SymTableEntry &E : SymbolTable) {
     ++I;
@@ -72,11 +91,7 @@ void DWARFGdbIndex::dumpSymbolTable(raw_ostream &OS) const {
     StringRef Name = ConstantPoolStrings.substr(
         ConstantPoolOffset - StringPoolOffset + E.NameOffset);
 
-    auto CuVector = llvm::find_if(
-        ConstantPoolVectors,
-        [&](const std::pair<uint32_t, SmallVector<uint32_t, 0>> &V) {
-          return V.first == E.VecOffset;
-        });
+    auto CuVector = FindCuVector(E.VecOffset);
     assert(CuVector != ConstantPoolVectors.end() && "Invalid symbol table");
     uint32_t CuVectorId = CuVector - ConstantPoolVectors.begin();
     OS << format("      String name: %s, CU vector index: %d\n", Name.data(),

itrofimow · 2025-08-02T08:31:13Z

I have a binary with gdb-index of size ~250Mb, and for that binary llvm-dwarfdump --gdb-index takes basically forever (10+ minutes) to complete.
With the patch applied it takes ~5s

dwblaikie

Thanks for the improvement!

Hmm, actually at a high level: I guess this ConstantPoolVectors isn't sorted, is it? So we can't do a binary search... could we sort it? I guess not - since we do want to dump it in a way that matches the input too (in case the on-disk ordering is important to debugging the data at some point)?

(oh, and high level question, if you're interested/able: What's your interest in gdb_index? Myself, I've worked on various indexing solutions at Google due to the large size of single binaries we have, for a while but we rarely see traction/interest in these tools outside of Google - so it's always interesting to make friends with folks who are facing similar problems)

llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp

dwblaikie · 2025-08-05T16:45:20Z

llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp

+       ++it) {
+    CuVectorMap.emplace(it->first, it);
+  }


Could drop the {} here.

Could use a range-based for loop, and instead of putting iterators as values in the map, use pointers (then you can get a pointer to the value in the range based for loop where there aren't any visible/name-able iterators)

I've rearranged the code a bit to just hold the ids in the map, instead of calculating these ids later

github-actions · 2025-08-06T11:42:27Z

✅ With the latest revision this PR passed the C/C++ code formatter.

itrofimow · 2025-08-06T12:50:34Z

Thanks for the review!

Hmm, actually at a high level: I guess this ConstantPoolVectors isn't sorted, is it? So we can't do a binary search... could we sort it? I guess not - since we do want to dump it in a way that matches the input too (in case the on-disk ordering is important to debugging the data at some point)?

I think it is sorted by construction, but given that we look for exact match, for big enough vectors it likely would still be faster to put offset->id in a hash map rather than do a binary search.

oh, and high level question, if you're interested/able: What's your interest in gdb_index?

Actually, I'm not particularly interested in gdb_index itself, although it's always nice to learn how smart people organize their data; I've encountered this inefficiency when debugging BOLT generating gdb-index in ways that break gdb (#151857, #151861).

We at Yandex are also dealing with somewhat large binaries, and even though our toolchain is mostly llvm, gdb is the go-to debugger for probably historical reasons, and there is a lot of infrastructure built on top of it here and there (coredump analysis, some custom poor man's profilers, pretty-printers etc.). At Perforator we are working to fully support generating BOLT-profiles, and thereafter incorporate BOLT into release pipelines; these BOLT issues manifested themselves, and I went digging

Myself, I've worked on various indexing solutions at Google due to the large size of single binaries we have

Although I'm not particularly interested in gdb_index, I am very much interested into another index format: GSYM. Given your vast experience with different indexes, maybe you know places/folks to ask questions/request reviews about it?

itrofimow · 2025-08-06T12:51:53Z

CI failure is the CodeGen/AArch64/midpoint-int.ll test, which seems completely unrelated to me

dwblaikie · 2025-08-06T17:54:02Z

Myself, I've worked on various indexing solutions at Google due to the large size of single binaries we have

Although I'm not particularly interested in gdb_index, I am very much interested into another index format: GSYM. Given your vast experience with different indexes, maybe you know places/folks to ask questions/request reviews about it?

Yeah, that's definitely @clayborg's wheelhouse.

(thanks for the other context on your use cases)

Hmm, actually at a high level: I guess this ConstantPoolVectors isn't sorted, is it? So we can't do a binary search... could we sort it? I guess not - since we do want to dump it in a way that matches the input too (in case the on-disk ordering is important to debugging the data at some point)?

I think it is sorted by construction, but given that we look for exact match, for big enough vectors it likely would still be faster to put offset->id in a hash map rather than do a binary search.

Hmm - is it? It looked like teh offsets were read in from the file in the SymTableSize loop parseImpl - doesn't look like that's necessarily ordered...

itrofimow · 2025-08-06T18:40:04Z

Thanks for the directions! I'll reach out to Greg about GSYM

Hmm - is it? It looked like teh offsets were read in from the file in the SymTableSize loop parseImpl - doesn't look like that's necessarily ordered...

The CUOffsets are read in here into a set

llvm-project/llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp

Lines 170 to 177 in 4d64fcd

    
           std::set<uint32_t> CUOffsets; 
        
           for (uint32_t i = 0; i < SymTableSize; ++i) { 
        
             uint32_t NameOffset = Data.getU32(&Offset); 
        
             uint32_t CuVecOffset = Data.getU32(&Offset); 
        
             SymbolTable.push_back({NameOffset, CuVecOffset}); 
        
             if (NameOffset || CuVecOffset) 
        
               CUOffsets.insert(CuVecOffset); 
        
           }

and later used to populate ConstantPoolVectors like this

llvm-project/llvm/lib/DebugInfo/DWARF/DWARFGdbIndex.cpp

Lines 182 to 186 in 4d64fcd

    
           for (auto CUOffset : CUOffsets) { 
        
             Offset = ConstantPoolOffset + CUOffset; 
        
             ConstantPoolVectors.emplace_back(0, SmallVector<uint32_t, 0>()); 
        
             auto &Vec = ConstantPoolVectors.back(); 
        
             Vec.first = Offset - ConstantPoolOffset;

so I think they are indeed ordered, but I still would prefer a hash-map here :)

itrofimow · 2025-08-06T19:21:57Z

The CI is green after rebase; could you @dwblaikie please merge this PR for me? - I don't have write access

dwblaikie · 2025-08-06T20:36:42Z

Ah, right - didn't notice it was reading into a set.

For argument's sake - could you check the performance with a binary search? See if it's acceptable for your use case. Be nice not to have to build another data structure.

itrofimow · 2025-08-06T21:07:20Z

For argument's sake - could you check the performance with a binary search? See if it's acceptable for your use case. Be nice not to have to build another data structure.

With the hashtable user-time averages to 3.73s
With the binary search user-time averages to 4.21s

Both are perfectly fine for me, your call

dwblaikie · 2025-08-06T22:06:54Z

Really appreciate your willingness to give it a go. And that kind of tradeoff is prefer the binary search - especially if the implementation comes out pretty tidy/small - hopefully can use llvm:: binary_search or similar

[DWARF] Speedup .gdb_index dumping

0d22a6f

llvmbot added the debuginfo label Aug 2, 2025

dwblaikie reviewed Aug 5, 2025

View reviewed changes

cr fixes

4d30a5c

itrofimow added 3 commits August 6, 2025 14:59

cleanup

72ff5d8

clang-format fixes

9601b2f

clang-format fixes

b51c223

itrofimow requested a review from dwblaikie August 6, 2025 12:52

dwblaikie approved these changes Aug 6, 2025

View reviewed changes

Merge branch 'main' into llvm_dwarfdump_gdb_index_speedup

ab33820

switch the implementation to binary search instead

6abe831

dwblaikie approved these changes Aug 6, 2025

View reviewed changes

dwblaikie merged commit 069bf18 into llvm:main Aug 7, 2025
9 checks passed

[DWARF] Speedup .gdb_index dumping #151806

[DWARF] Speedup .gdb_index dumping #151806

Uh oh!

Conversation

itrofimow commented Aug 2, 2025

Uh oh!

llvmbot commented Aug 2, 2025

Uh oh!

itrofimow commented Aug 2, 2025

Uh oh!

dwblaikie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dwblaikie Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

itrofimow Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itrofimow commented Aug 6, 2025

Uh oh!

itrofimow commented Aug 6, 2025

Uh oh!

dwblaikie commented Aug 6, 2025

Uh oh!

itrofimow commented Aug 6, 2025

Uh oh!

itrofimow commented Aug 6, 2025

Uh oh!

dwblaikie commented Aug 6, 2025

Uh oh!

itrofimow commented Aug 6, 2025

Uh oh!

dwblaikie commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 6, 2025 •

edited

Loading